The description of variables as provided by Kaggle: https://www.kaggle.com/c/titanic
VARIABLE DESCRIPTIONS:
survival Survival
(0 = No; 1 = Yes)
pclass Passenger Class
(1 = 1st; 2 = 2nd; 3 = 3rd)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare
cabin Cabin
embarked Port of Embarkation
(C = Cherbourg; Q = Queenstown; S = Southampton)
SPECIAL NOTES:
Pclass is a proxy for socio-economic status (SES)
1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower
Age is in Years; Fractional if Age less than One (1)
If the Age is Estimated, it is in the form xx.5
With respect to the family relation variables (i.e. sibsp and parch)
some relations were ignored. The following are the definitions used
for sibsp and parch.
Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent: Mother or Father of Passenger Aboard Titanic
Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic
Other family relatives excluded from this study include cousins,
nephews/nieces, aunts/uncles, and in-laws. Some children travelled
only with a nanny, therefore parch=0 for them. As well, some
travelled with very close friends or neighbors in a village, however,
the definitions do not support such relations.
In [1]:
# file location
file_path = "./titanic_data.csv"
In [2]:
# import pandas
import numpy as np
import pandas as pd
In [1]:
def get_dataframe(csv_file):
'''read .csv file.
parameters:
-----------
csv_file : a file path in csv format.
return:
-----------
return pandas dataframe
'''
return pd.read_csv(csv_file)
In [4]:
def show_info(df, show_all):
'''print the information.
Parameters:
-----------
df : a dataframe.
show_all : a flage to print out info, if true print whole data, if false print first and last five row
'''
print "The information of dataframe\n"
print df.info()
print "-----------------------------------------------------------------------------\n"
if show_all:
print "DataFrame content"
print df
print "-----------------------------------------------------------------------------\n"
else:
print "First five row of DataFrame"
print df.head()
print "-----------------------------------------------------------------------------\n"
print "Last five row of DataFrame"
print df.tail()
print "-----------------------------------------------------------------------------\n"
print "The basical descriptive statistic"
print df.describe()
print "-----------------------------------------------------------------------------\n"
In [5]:
# open titanic dataset with pandas
titanic = get_dataframe(file_path)
titanic2 = get_dataframe(file_path)
In [6]:
# inspect the overall information of data
show_info(titanic, False)
1.) Who are the passengers of Titanic?
2.) How old are the passengers?
3.) Where are the passengers came from?
4.) Who are purchase expensive ticket?
5.) Dose passenger class related with cabin position
6.) Who are the passengers travel alone or travel with family?
7.) What is the significant factors helped some passengers survive?
From dataset found 3 column had missing data "Age" -> missing 177 rows "Cabin" -> missing 687 rows "Embarked" -> missing 2 rows, let's start to fixed the missing data first
In [7]:
## Find the total number of rows and the number of unique students (account keys)
## in each table.
def count_passenger(df):
'''
find the total number of row in dataframe
Parameters:
-----------
df : a dataframe.
return:
-----------
len(df) : data lenght
'''
return len(df)
In [8]:
def remove_column(df, key):
'''
drop list of key from dataframe, return dataframe that exclude that list key
Parameters:
-----------
df : a dataframe.
key : a key to remove from dataframe
return:
-----------
new_df : data frame without key
'''
new_df = df.copy()
return new_df.drop(key, axis=1)
In [9]:
def had_family(df):
'''
Check is passenger travel with family or alone
Parameters:
-----------
df : a dataframe.
return:
-----------
'With family' : if travel with family
'Alone' : if travel alone
'''
if df > 0:
return 'With family'
else:
return 'Alone'
In [10]:
def had_alive(df):
'''
Check is passenger who alived or died
Parameters:
-----------
df : a dataframe.
return:
-----------
'Alive' : if Survied = 1
'Died' : if Survied = 0
'''
if df > 0:
return "alive"
else:
return "died"
In [11]:
def had_class(df):
'''
Check is passenger class
Parameters:
-----------
df : a dataframe.
return:
-----------
'1st class' : if Pclass = 1
'2nd class' : if Pclass = 2
'3rd class' : if Pclass = 3
'''
if df == 1:
return 'First'
elif df == 2 :
return 'Second'
else:
return 'Third'
In [12]:
# First let's make a function to sort through the sex
def had_child(passenger):
'''
Treat passenger who age under 16 as a child
Parameters:
-----------
passenger : a objcect.
return:
-----------
'child' : if age <= 15
sex : if age > 15
'''
# Take the Age and Sex
age, sex = passenger
if age <= 15:
return 'Child'
else:
return sex
In [13]:
# find all the unique values for "Age"
print(titanic['Age'].unique())
In [14]:
# fill missing "Age" with mean
titanic['Age'] = titanic['Age'].fillna(titanic['Age'].median())
# inspect dataset
#show_info(titanic, True)
In [15]:
# find all the unique values for "Age"
print titanic['Embarked'].unique()
In [16]:
# the most common embarkation port is Southampton, so let's assume everyone got on there.
# replace all the missing values in the Embarked column with S.
titanic['Embarked'] = titanic['Embarked'].fillna('S')
In [17]:
# find all the unique values for "Age"
print titanic['Cabin'].unique()
In [18]:
# we see the "Cabin" show 687 missing rows,
# assume passenger who held same ticket number should seat in same cabin
# loop through rows in datafram and fill Cabin
for i, row in titanic.iterrows():
if pd.isnull(row['Cabin']):
continue
else:
for j, row in titanic.iterrows():
if pd.isnull(titanic.loc[j,'Cabin']):
if titanic.loc[j,'Ticket'] == titanic.loc[i,'Ticket']:
titanic.loc[j,'Cabin'] = titanic.loc[i,'Cabin']
break
In [19]:
titanic.head()
Out[19]:
In [20]:
# add "Passenger" column represent who alive or died
titanic['Passenger'] = map(had_alive, titanic['Survived'])
In [21]:
# add "Gender" column these map Sex from text to number for future analysis
titanic['Gender'] = titanic['Sex'].map({'female' : 0, 'male' : 1}).astype(int)
In [22]:
# add "Who" column, to categorize passenger by group of male, female
# and children as who under 15 as a child,
titanic['Who'] = titanic[['Age','Sex']].apply(had_child, axis=1)
In [23]:
# add "Port" column these map each embarktation from text to number for future analysis
titanic['Port'] = titanic['Embarked'].dropna().map({'C' : 0, 'Q' : 1, 'S': 2 }).astype(int)
In [24]:
# add "Class" column represen clss of each passenger
titanic['Class'] = map(had_class, titanic['Pclass'])
In [25]:
# add "Family" column to represent passenger who travel alone or with their family
titanic['Family'] = map(had_family, titanic['Parch'] + titanic['SibSp'])
# drop "Parch" & "SibSp"
titanic = remove_column(titanic, ['Parch', 'SibSp'])
In [26]:
titanic.head()
Out[26]:
In [27]:
# import library for the analysis and visualization
import math
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pylab import rcParams
from pandas.tools.plotting import scatter_matrix
sns.set(style="white")
%matplotlib inline
In [28]:
# get total passenger
passengercount = float(len(titanic))
In [29]:
# Investigate who were the passenger
# show passenger count
passenger_by_who = titanic.groupby('Who')
print passenger_by_who.count()['Passenger']
# set plot title
sns.plt.title('Titanic passengers categorize')
# show the counts of passengers
ax = sns.countplot(x = 'Who', data = titanic)
# add percentage for each group
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x(), height + 10, '%1.2f'%((height*100)/passengercount))
The most passengers were males, and show double of females
In [30]:
# Investigate who were the passenger
# show passenger count
passenger_by_who_class = titanic.groupby(['Who','Class'])
print passenger_by_who_class.count()['Passenger']
# set plot title
ax = sns.plt.title('Titanic passengers by Class')
# show the counts of passengers
ax = sns.countplot(x = 'Who', hue = 'Class', data = titanic)
# add percentage for each group
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x(), height + 10, '%1.2f'%((height*100)/passengercount))
In the third class males were double of female and most child were in this class
In [31]:
# show passenger age
passenger_by_age_who = titanic.groupby(['Who'])
print passenger_by_age_who['Age'].mean()
In [32]:
# quick look the distribution
ax = sns.distplot(titanic['Age'], hist = True)
# set plot title
ax = sns.plt.title('Titanic passengers by Age')
In [34]:
# check age for passenger by type
# faceplot by passenger type
ax = sns.FacetGrid(titanic, hue = 'Who', aspect = 3)
ax.map(sns.kdeplot,'Age',shade = True)
oldest = titanic['Age'].max()
ax.set(xlim = (0, oldest))
ax.add_legend();
# set title
ax = sns.plt.title('Titanic passengers Age by person')
Look at passenger gender, whom male and female aged were not difference
In [35]:
# check age for passenger by class
# faceplot by passenger type
ax = sns.FacetGrid(titanic, hue = 'Class', aspect = 3)
ax.map(sns.kdeplot,'Age',shade = True)
oldest = titanic['Age'].max()
ax.set(xlim=(0,oldest))
ax.add_legend();
# set title
ax = sns.plt.title('Titanic passengers Age by Class')
Passenger class look not difference for each class
In [36]:
# map Embarked port character to full name of each town
titanic['Embarked'] = titanic['Embarked'].dropna().map({'C' : 'Cherbourg', 'Q' : 'Queenstown', 'S': 'Southampton' })
In [37]:
# Check passenger who from
passenger_by_embark_who = titanic.groupby('Embarked')
print passenger_by_embark_who.count()['Passenger']
# set plot title
ax = sns.plt.title('Port of Embarkation')
# show the counts of passengers
ax = sns.countplot(x = 'Embarked', data = titanic)
# add percentage for each group
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x(), height + 10, '%1.2f'%((height*100)/passengercount))
In [38]:
# Check passenger who from
passenger_by_embark_who = titanic.groupby(['Embarked','Who'])
print passenger_by_embark_who.count()['Passenger']
# set plot title
ax = sns.plt.title('Port of Embarkation')
# show the counts of passengers
ax = sns.countplot(x = 'Embarked', hue = 'Who', data = titanic)
# add percentage for each group
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x(), height + 10, '%1.2f'%((height*100)/passengercount))
Most passgeners came from Southampton, which number of males were double of female
In [39]:
# quick look passenger embarktation
passenger_by_class_who = titanic.groupby(['Class','Embarked'])
print passenger_by_class_who.count()['Passenger']
# set plot title
ax = sns.plt.title('Embarkation by Class')
# show the counts of passengers
ax = sns.countplot(x = 'Class', hue = 'Embarked', data = titanic)
# add percentage for each group
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x(), height + 10, '%1.2f'%((height*100)/passengercount))
In Queentown, almost passengers that boarded there were third class.
In [40]:
# quick check clss of embarked
group_by_ses_embark = titanic.groupby(['Embarked'])
print group_by_ses_embark['Fare'].agg([np.sum, np.max, np.min, np.mean, np.std])
In [41]:
# plot distribution of ticket price
ax = sns.FacetGrid(titanic, col = 'Class', size = 5, aspect = 0.7)
(ax.map(sns.boxplot, 'Embarked', 'Fare')
.despine(left = True));
In [42]:
# check passengers fare for class
ax = sns.FacetGrid(titanic, hue = 'Class', aspect = 3)
ax.map(sns.kdeplot,'Fare', shade = True)
oldest = titanic['Fare'].max()
ax.set(xlim = (0,oldest))
ax.add_legend()
# set title
ax = sns.plt.title('Ticket fare by class')
First class fare was expensive than other class.
In [43]:
# check passengers fare by embarked
ax = sns.FacetGrid(titanic, hue = 'Embarked', aspect = 3)
ax.map(sns.kdeplot,'Fare',shade = True)
oldest = titanic['Fare'].max()
ax.set(xlim = (0,oldest))
ax.add_legend()
ax = sns.plt.title('Ticket fare by port of embarked')
In [44]:
# check passengers fare by embarked
ax = sns.FacetGrid(titanic, hue = 'Who', aspect = 3)
ax.map(sns.kdeplot,'Fare',shade = True)
oldest = titanic['Fare'].max()
ax.set(xlim = (0,oldest))
ax.add_legend()
ax = sns.plt.title('Ticket fare by passenger')
Cheap fare most sold in Queenstown, look seem the economics of this town inwealthy
In [45]:
# create a new datafram that rmove missing cabin
titanic_no_missing = titanic.dropna()
In [46]:
titanic_no_missing.info()
In [52]:
# grab only letter for the cabin position
for i, x in titanic_no_missing['Cabin'].iteritems():
titanic_no_missing.loc[i,('Cabin')] = x[0]
#print titanic_no_missing['Cabin']
In [53]:
# Quick preview of the cabin position
titanic_no_missing.head()
Out[53]:
In [54]:
# get total cabin passenger
cabincount = float(len(titanic_no_missing))
In [55]:
# quick look cabin row
passenger_by_class_who = titanic_no_missing.groupby('Cabin')
print passenger_by_class_who.count()['Passenger']
# set plot title
ax = sns.plt.title('Titanic passenger cabin')
# show the counts of passengers
ax = sns.countplot(x = 'Cabin',data = titanic_no_missing, palette='summer')
# add percentage for each group
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x(), height, '%1.2f'%((height*100)/cabincount))
In [56]:
# look at cabin row T was out of group, so redefine cabin data again
titanic_no_missing = titanic_no_missing[titanic_no_missing['Cabin'] != 'T']
In [57]:
# plot cabin position
passenger_by_class_who = titanic_no_missing.groupby('Cabin')
print passenger_by_class_who.count()['Passenger']
# set plot title
ax = sns.plt.title('Titanic passenger cabin')
# show the counts of passengers
ax = sns.countplot(x = 'Cabin',data = titanic_no_missing, palette='summer')
# add percentage for each group
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x(), height, '%1.2f'%((height*100)/cabincount))
In [58]:
# check position by passengers class
passenger_by_class_who = titanic_no_missing.groupby(['Class', 'Cabin'])
print passenger_by_class_who.count()['Passenger']
# set plot title
ax = sns.plt.title('Passenger cabin by Class')
# show the counts of passengers
ax = sns.countplot(x = 'Class', hue = 'Cabin', data = titanic_no_missing)
# add percentage for each group
for p in ax.patches:
height = p.get_height()
if math.isnan(height):
continue
else:
ax.text(p.get_x(), height, '%1.2f'%((height*100)/cabincount))
#continue
In [59]:
# check cabin position by port of embarkation
passenger_by_class_who = titanic_no_missing.groupby(['Embarked', 'Cabin'])
print passenger_by_class_who.count()['Passenger']
# set plot title
ax = sns.plt.title('Passenger cabin by Embarked')
# show the counts of passengers
ax = sns.countplot(x = 'Embarked', hue = 'Cabin', data = titanic_no_missing)
# add percentage for each group
for p in ax.patches:
height = p.get_height()
if math.isnan(height):
continue
else:
ax.text(p.get_x(), height, '%1.2f'%((height*100)/cabincount))
#continue
In [60]:
# check cabin position by passenger
passenger_by_class_who = titanic_no_missing.groupby(['Who', 'Cabin'])
print passenger_by_class_who.count()['Passenger']
# set plot title
ax = sns.plt.title('Passenger cabin by passengers')
# show the counts of passengers
ax = sns.countplot(x = 'Who', hue = 'Cabin', data = titanic_no_missing)
# add percentage for each group
for p in ax.patches:
height = p.get_height()
if math.isnan(height):
continue
else:
ax.text(p.get_x(), height, '%1.2f'%((height*100)/cabincount))
#continue
From data, the passengers on 1st class almost on cabin row A, B, C and D, Almost passenger from Queentown were on cabin row E and F
In [61]:
# check to make sure it worked
titanic['Family'].count()
Out[61]:
In [62]:
# plot cabin position
passenger_by_class_who = titanic.groupby('Family')
print passenger_by_class_who.count()['Passenger']
# set plot title
ax = sns.plt.title('Titanic passenger family')
# show the counts of passengers
ax = sns.countplot(x = 'Family',data = titanic, palette='summer')
# add percentage for each group
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x(), height + 5, '%1.2f'%((height*100)/passengercount))
Look like almost of passengers their were travel alone
In [63]:
# plot cabin position
passenger_by_class_who = titanic.groupby(['Who', 'Family'])
print passenger_by_class_who.count()['Passenger']
# set plot title
ax = sns.plt.title('Titanic passenger family by passenger')
# show the counts of passengers
ax = sns.countplot(x = 'Who', hue = 'Family', data = titanic)
# add percentage for each group
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x(), height + 2, '%1.2f'%((height*100)/passengercount))
In [64]:
# plot cabin position
passenger_by_class_who = titanic.groupby(['Embarked', 'Family'])
print passenger_by_class_who.count()['Passenger']
# set plot title
ax = sns.plt.title('Titanic passenger family by embarkation')
# show the counts of passengers
ax = sns.countplot(x = 'Embarked', hue = 'Family', data = titanic)
# add percentage for each group
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x(), height + 1, '%1.2f'%((height*100)/passengercount))
In [65]:
# plot cabin position
passenger_by_class_who = titanic.groupby(['Class', 'Family'])
print passenger_by_class_who.count()['Passenger']
# set plot title
ax = sns.plt.title('Titanic passenger family by Class')
# show the counts of passengers
ax = sns.countplot(x = 'Class', hue = 'Family', data = titanic)
# add percentage for each group
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x(), height + 1, '%1.2f'%((height*100)/passengercount))
From this result, most alone passengers were male. Passenger who from Southampton were travel alone
In [66]:
# check survial rate
passenger_by_class_who = titanic.groupby('Passenger')
print passenger_by_class_who.count()['Survived']
# set plot title
ax = sns.plt.title('Titanic survival passenger')
# show the counts of passengers
ax = sns.countplot(x = 'Passenger', data = titanic)
# add percentage for each group
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x(), height + 2, '%1.2f'%((height*100)/passengercount))
Look seem a few more passenger died than those who survived.
Let's take a look which variables had an effect on their survival rate
In [67]:
# plot survival probability against several variables
ax = sns.factorplot(x = 'Class', y = 'Survived',
hue = 'Who', col = 'Embarked',
data = titanic, size = 5, aspect = .6)
The survival rates for men are too lower than women and child for all class and embarked. The 3rd class women passenger survival rates look like it lowest than other class. The 2nd class men like survival rates very low thand other class.
From this data the result is show that being a male in any class dramatically decreases a chances of survival.
In [68]:
# set range of age for linear plot
age_range = [10,20,40,60,80]
In [69]:
# looking age versus survival by linear plot
ax = sns.lmplot('Age','Survived',
data = titanic, palette = 'winter',
x_bins = age_range)
Looks like there is a formal trend that an older passenger less survival rate.
In [70]:
# how about survival rate if relate class and age
# use use linear plot on age versus survival by each class
ax = sns.lmplot('Age', 'Survived', hue ='Class',
data = titanic, palette = 'winter',
x_bins = age_range)
This result show that being a first class in any age dramatically increases a chances of survival.
In [71]:
# what about if relate gender and age effect survival rate?
ax = sns.lmplot('Age', 'Survived', hue = 'Sex',
data = titanic, palette = 'winter',
x_bins = age_range)
Look like that being a male in any age will decreases a chances of survival.
From the result is show that being a male in any age will lowest a chances of survival.
Now investigate what about ticket fare relate survival rate ?
In [72]:
# set range of price
farerange = [0,250,500,750,1000]
In [73]:
# plot data of fare versus survival
ax = sns.lmplot('Fare', 'Survived', data = titanic,
palette = 'winter', x_bins = farerange)
Look like higher fare will increase chance of survival rate, this maybe related with 1st class passenger have higher survived
In [74]:
# check survived rate relate fare by gender
ax = sns.lmplot('Fare', 'Survived', hue = 'Sex', col = 'Class',
data = titanic, palette = 'winter',
x_bins = farerange)
In [75]:
# check survived rate relate fare by gender
ax = sns.lmplot('Fare', 'Survived', hue = 'Sex', col = 'Embarked',
data = titanic, palette = 'winter',
x_bins = farerange)
From the result, look seek, higher fare may little bit increased survival chance
Now, how if passenger who travel with family will increase chance of survival?
In [76]:
# how about survival rate if relate with family or alone
ax = sns.factorplot(x = 'Family', y = 'Survived',
data = titanic, size = 5, aspect = .6)
Wow!, interesting result, faminly passenger show hight survived than who alone.
In [77]:
# how about survival rate if relate with family or alone
ax = sns.factorplot(x = 'Family', y = 'Survived',
hue = 'Class', col = 'Embarked',
data = titanic, size = 5, aspect = .6)
In [78]:
# how about survival rate of each gender who with family or alone
ax = sns.factorplot(x = 'Family', y = 'Survived',
hue = 'Sex', col = 'Embarked',
data = titanic, size = 5, aspect = .6)
Look like passenger who travel with family have increased survival rate even being male.
Awesome! we've gotten some really great insights on how gender,age, and class all related to a passengers chance of survival. Now you take control: Answer the following questions using pandas and seaborn: 1.) Did the deck have an effect on the passengers survival rate? Did this answer match up with your intuition? 2.) Did having a family member increase the odds of surviving the crash? Feel free to post a discussion if you get stuck or have more ideas!
Now remain the cabin position data, it relate to survived?
Now, how if passenger who travel with family will increase chance of survival?
In [79]:
# plot cabin position
passenger_by_class_who = titanic_no_missing.groupby('Cabin')
print passenger_by_class_who.count()['Survived']
# set plot title
ax = sns.plt.title('Titanic survival by cabin')
# show the counts of passengers
ax = sns.countplot(x = 'Cabin', hue = 'Passenger', data = titanic_no_missing)
# add percentage for each group
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x(), height + 2, '%1.2f'%((height*100)/passengercount))
In [80]:
# how about survival rate of each gender who with family or alone
ax = sns.factorplot(x = 'Cabin', y = 'Survived',
hue = 'Who', col = 'Class',
data = titanic_no_missing, size = 5, aspect = .6)
From the data it looks like cabin B, C, D have higher survived, this maybe almost passenger 1st class were on these cabin
- Most 1st class passenger who purchase higher ticket
- Passenger from Cherbourg who purchase higher ticket
sum amax amin mean std
Cherbourg 10072.2962 512.3292 4.0125 59.954144 83.912994
Southampton 17599.3988 263.0000 0.0000 27.243651 35.952905
Queenstown 1022.2543 90.0000 6.7500 13.276030 14.188047
- The first most factor of survival is genger being a female increases a chances of survival.
- The second factor is age, youngest passenger may increase a chances of survival.
- The third factor is passenger class, the 1st class passengers who will have more chance to surived, this also related to cabin position and fare.
- https://www.kaggle.com/c/titanic
- https://stanford.edu/~mwaskom/software/seaborn/tutorial.html
- http://pandas.pydata.org/pandas-docs/stable/10min.html
- http://pandas.pydata.org/pandas-docs/stable/tutorials.html
- http://www.scipy-lectures.org/intro/matplotlib/matplotlib.html
- https://docs.scipy.org/doc/numpy-dev/user/quickstart.html